23 research outputs found
Unsupervised 3D Pose Estimation with Geometric Self-Supervision
We present an unsupervised learning approach to recover 3D human pose from 2D
skeletal joints extracted from a single image. Our method does not require any
multi-view image data, 3D skeletons, correspondences between 2D-3D points, or
use previously learned 3D priors during training. A lifting network accepts 2D
landmarks as inputs and generates a corresponding 3D skeleton estimate. During
training, the recovered 3D skeleton is reprojected on random camera viewpoints
to generate new "synthetic" 2D poses. By lifting the synthetic 2D poses back to
3D and re-projecting them in the original camera view, we can define
self-consistency loss both in 3D and in 2D. The training can thus be self
supervised by exploiting the geometric self-consistency of the
lift-reproject-lift process. We show that self-consistency alone is not
sufficient to generate realistic skeletons, however adding a 2D pose
discriminator enables the lifter to output valid 3D poses. Additionally, to
learn from 2D poses "in the wild", we train an unsupervised 2D domain adapter
network to allow for an expansion of 2D data. This improves results and
demonstrates the usefulness of 2D pose data for unsupervised 3D lifting.
Results on Human3.6M dataset for 3D human pose estimation demonstrate that our
approach improves upon the previous unsupervised methods by 30% and outperforms
many weakly supervised approaches that explicitly use 3D data
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
Previous research has studied the task of segmenting cinematic videos into
scenes and into narrative acts. However, these studies have overlooked the
essential task of multimodal alignment and fusion for effectively and
efficiently processing long-form videos (>60min). In this paper, we introduce
Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic
long-video segmentation. MEGA tackles the challenge by leveraging multiple
media modalities. The method coarsely aligns inputs of variable lengths and
different modalities with alignment positional encoding. To maintain temporal
synchronization while reducing computation, we further introduce an enhanced
bottleneck fusion layer which uses temporal alignment. Additionally, MEGA
employs a novel contrastive loss to synchronize and transfer labels across
modalities, enabling act segmentation from labeled synopsis sentences on video
shots. Our experimental results show that MEGA outperforms state-of-the-art
methods on MovieNet dataset for scene segmentation (with an Average Precision
improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total
Agreement improvement of +5.51%)Comment: ICCV 2023 accepte
Motion-Guided Masking for Spatiotemporal Representation Learning
Several recent works have directly extended the image masked autoencoder
(MAE) with random masking into video domain, achieving promising results.
However, unlike images, both spatial and temporal information are important for
video understanding. This suggests that the random masking strategy that is
inherited from the image MAE is less effective for video MAE. This motivates
the design of a novel masking algorithm that can more efficiently make use of
video saliency. Specifically, we propose a motion-guided masking algorithm
(MGM) which leverages motion vectors to guide the position of each mask over
time. Crucially, these motion-based correspondences can be directly obtained
from information stored in the compressed format of the video, which makes our
method efficient and scalable. On two challenging large-scale video benchmarks
(Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and
achieve up to + improvement compared to previous state-of-the-art
methods. Additionally, our MGM achieves equivalent performance to previous
video MAE using up to fewer training epochs. Lastly, we show that MGM
generalizes better to downstream transfer learning and domain adaptation tasks
on the UCF101, HMDB51, and Diving48 datasets, achieving up to +
improvement compared to baseline methods.Comment: Accepted to ICCV 202
Genome-Wide Association Data Reveal a Global Map of Genetic Interactions among Protein Complexes
This work demonstrates how gene association studies can be analyzed to map a global landscape of genetic interactions among protein complexes and pathways. Despite the immense potential of gene association studies, they have been challenging to analyze because most traits are complex, involving the combined effect of mutations at many different genes. Due to lack of statistical power, only the strongest single markers are typically identified. Here, we present an integrative approach that greatly increases power through marker clustering and projection of marker interactions within and across protein complexes. Applied to a recent gene association study in yeast, this approach identifies 2,023 genetic interactions which map to 208 functional interactions among protein complexes. We show that such interactions are analogous to interactions derived through reverse genetic screens and that they provide coverage in areas not yet tested by reverse genetic analysis. This work has the potential to transform gene association studies, by elevating the analysis from the level of individual markers to global maps of genetic interactions. As proof of principle, we use synthetic genetic screens to confirm numerous novel genetic interactions for the INO80 chromatin remodeling complex